Skip to content

fix: markdown parsing bugs affecting wiki-style content#187

Open
knee5 wants to merge 7 commits intogarrytan:masterfrom
knee5:fix/wiki-markdown-compat
Open

fix: markdown parsing bugs affecting wiki-style content#187
knee5 wants to merge 7 commits intogarrytan:masterfrom
knee5:fix/wiki-markdown-compat

Conversation

@knee5
Copy link
Copy Markdown
Contributor

@knee5 knee5 commented Apr 17, 2026

Summary

  • splitBody mis-splits on horizontal rules: plain --- in article bodies was treated as a compiled_truth/timeline separator, causing 83% content truncation across a 1,991-article knowledge base (e.g., a 23,887-byte article stored as 593 bytes; 4,856 of 6,680 wikilinks lost from the DB).
  • inferType missing wiki subtype mappings: articles under /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/ all silently defaulted to type='concept', breaking type-filtered queries.
  • Frontmatter double-encoded in Postgres engine: JSON.stringify(x)::jsonb causes postgres.js v3 to store a JSONB string literal instead of an object, making frontmatter->>'key' return NULL and GIN indexes ineffective.

Why this matters

Any wiki/notebook with markdown horizontal rules or non-default subdirectory types hits these. Found while migrating a 1,991-article knowledge base where 83% of articles were truncated in the DB.

Changes

splitBody(): No longer treats plain --- as a sentinel. Recognized split points are now: <!-- timeline --> (preferred, unambiguous), --- timeline --- (decorated separator), or --- only when the next non-empty line is a ## Timeline or ## History heading (backward-compat fallback). serializeMarkdown updated to emit <!-- timeline --> for round-trip stability.

inferType(): Added path-segment mappings for /wiki/analysis/, /wiki/guides/ (and /wiki/guide/), /wiki/hardware/, /wiki/architecture/, and /wiki/concepts/ (and /wiki/concept/). concept remains the default fallback.

postgres-engine JSONB: Replaced all three JSON.stringify(x)::jsonb call sites (putPage, putRawData, logIngest) with this.sql.json(x), which is postgres.js v3's native JSONB serialization. Also fixed rowToChunk in utils.ts to handle embeddings returned as JSON strings (a symptom of the same mismatch). PGLite engine is unaffected.

Testing

854 tests pass. New tests added for: horizontal rules in body, <!-- timeline --> sentinel, heading-gated splits, wiki subtype inference.

Backwards compat

Existing PGLite behavior unchanged. Postgres engine writes now produce valid JSONB (queries that used frontmatter->>'key' which returned NULL will start returning values after next write — this is a correctness fix, but downstream code that relied on the NULL behavior should be checked).

garrytan added a commit that referenced this pull request Apr 18, 2026
- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 18, 2026
Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 18, 2026
extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan
Copy link
Copy Markdown
Owner

Hey @knee5 — thank you for finding and fixing this. The 1,991-article wiki repro made the bug case-shut. PR #196 (v0.12.1 hotfix) just merged into the queue and re-implements all three of your fixes (splitBody sentinel, inferType wiki subtypes, JSONB triple-fix) with expanded scope:

  • Schema audit found 5 affected JSONB columns, not 3 (added page_versions.frontmatter which is downstream via INSERT...SELECT, plus files.metadata from a separate command path)
  • New gbrain repair-jsonb command auto-fixes existing rows on upgrade via the v0_12_1 migration orchestrator
  • New test/e2e/postgres-jsonb.test.ts round-trips all 5 sites against real Postgres — the regression test that should have caught this originally
  • CI grep guard at scripts/check-jsonb-pattern.sh to prevent the bug class from sneaking back in
  • Wikilink extraction ([[page]] syntax) and resolveSlug ancestor search came over verbatim

Re-implemented rather than cherry-picked because PR #187 went CONFLICTING after Minions v7 / knowledge graph landed. Your test cases are ported verbatim. Co-authorship preserved in the commit trailers.

Mind if I close this PR in favor of #196? Want to make sure you're OK with the merger before I do — your name's on every relevant commit and the CHANGELOG. Thanks again.

Clevin Canales and others added 6 commits April 18, 2026 15:11
Plain `---` in article body was treated as compiled_truth/timeline separator.
Wikis using `---` as horizontal rules between sections experienced severe
content truncation — a 23,887-byte article could store as 593 bytes, and
4,856 of 6,680 wikilinks were lost from the DB (73%) across a 1,991-article
knowledge base.

Now splits only on explicit `<!-- timeline -->` sentinel, `--- timeline ---`,
or `---` when immediately followed by a `## Timeline` or `## History` heading.
serializeMarkdown updated to emit `<!-- timeline -->` for round-trip stability.

Tests added for horizontal rules, sentinel splits, and heading-gated splits.
Only `/wiki/concepts/` was mapped; articles under `/wiki/analysis/`,
`/wiki/guides/`, `/wiki/hardware/`, and `/wiki/architecture/` all
silently defaulted to `type='concept'`, producing incorrect metadata
and breaking any type-filtered queries.

Adds explicit path-segment mappings for the four missing subtypes.
`concept` remains the default fallback.
…gify()::jsonb

The Postgres engine was passing `JSON.stringify(x)::jsonb` to postgres.js.
Because postgres.js v3 sends that as a plain string, the DB stores a JSONB
value that is itself a JSON string literal — not an object. Consequently
`frontmatter->>'key'` returns NULL in SQL and GIN indexes are ineffective.

Replace all three call sites (putPage, putRawData, logIngest) with
`this.sql.json(x)`, which is postgres.js v3's native JSONB serialization
and causes the driver to send the value with the correct wire type.

Also fix rowToChunk in utils.ts to handle embeddings returned as JSON
strings (a related symptom of the same driver/cast mismatch).

PGLite engine is unaffected — it uses `$n::jsonb` with JSON.stringify,
which is correct for that driver.
The extract command's regex only matched standard markdown links
`[text](path.md)`, missing the `[[path|display]]` wikilinks used by
Obsidian-style knowledge bases. A 2,000-article vault with thousands of
wikilinks extracted 0 links because of this.

Now handles both syntaxes:
- Standard markdown: `[text](relative/path.md)`
- Wikilinks: `[[path/to/page]]` and `[[path/to/page|Display Text]]`

Skips external URLs in both cases. Normalizes wikilink targets to
include .md suffix when missing.

Note: target-slug resolution for wikilinks still needs refinement —
relative paths like `[[concepts/foo]]` don't map cleanly to DB slugs
like `tech/wiki/concepts/foo` without context. Tracked for follow-up.

Tests added for wikilink patterns, display text handling, external
URL filtering.
…slugs

Wikilinks in wiki-style KBs use various formats that the previous extractor
failed to resolve, dropping ~30% of valid links:

- Relative bare name: [[foo]] in tech/wiki/concepts/ → tech/wiki/concepts/foo
- Cross-type shorthand: [[analysis/foo]] in tech/wiki/guides/ → tech/wiki/analysis/foo
  (authors omit leading ../ thinking in wiki-root-relative terms)
- Cross-domain under-specified: [[../../finance/wiki/...]] from depth-3 dirs
  resolves one level short because authors write 2× ../ when 3× is needed to
  reach KB root — ancestor search corrects this
- Fully-qualified: [[tech/wiki/concepts/foo]] — now handled by root fallback
- Section anchors: [[page#section]] — now stripped; bare [[#anchor]] skipped

Adds resolveSlug(fileDir, relTarget, allSlugs) that first tries the standard
path.join resolution, then progressively strips leading path components from
fileDir (ancestor search) until a matching slug is found. Returns null for
genuinely dangling targets (no matching page exists anywhere in the KB).

Also strips section anchors (#heading) from wikilink paths in
extractMarkdownLinks — they're intra-page refs and were causing lookup misses.

Analysis on the user's 2,074-page KB:
- Previously resolved: 6,760 raw / 5,039 unique deduped disk links
- After fix: 8,594 raw / 6,641 unique deduped disk links (+32% unique)
- Remaining 1,241 raw links are genuinely dangling (no matching page)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pages

Surfaces pages with zero inbound wikilinks. Essential for content
enrichment cycles in KBs with 1000+ pages. By default filters out
auto-generated pages, raw sources, and pseudo-pages where no inbound
links is expected; --include-pseudo to disable.

Supports text (grouped by domain), --json, --count outputs.
Also exposed as find_orphans MCP operation.

Tests cover basic detection, filtering, all output modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@knee5 knee5 force-pushed the fix/wiki-markdown-compat branch from 25e1c56 to f50954f Compare April 18, 2026 19:24
…n canonical extractor

extractEntityRefs now recognizes both syntaxes equally:
  [Name](people/slug)      -- upstream original
  [[people/slug|Name]]     -- Obsidian wikilink (new)

Extends DIR_PATTERN to include domain-organized wiki slugs used by
Karpathy-style knowledge bases:
  - entities  (legacy prefix some brains keep during migration)
  - projects  (gbrain canonical, was missing from regex)
  - tech, finance, personal, openclaw (domain-organized wiki roots)

Before this change, a 2,100-page brain with wikilinks throughout extracted
zero auto-links on put_page because the regex only matched markdown-style
[name](path). After: 1,377 new typed edges on a single extract --source db
pass over the same corpus.

Matches the behavior of the extract.ts filesystem walker (which already
handled wikilinks as of the wiki-markdown-compat fix wave), so the db and
fs sources now produce the same link graph from the same content.

Both patterns share the DIR_PATTERN constant so adding a new entity dir
only requires updating one string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@knee5
Copy link
Copy Markdown
Contributor Author

knee5 commented Apr 18, 2026

Hey @garrytan — thank you, better outcome than I hoped for. The expanded JSONB audit (page_versions.frontmatter + files.metadata), the gbrain repair-jsonb migration orchestrator, the round-trip E2E test, and the CI grep guard are exactly the kind of scope tightening I didn't know to do. Test cases ported with co-authorship is generous.

Yes, please close #187 in favor of #196.

Two scope items from the original PR aren't in v0.12.0 and I'd like to offer as focused follow-up PRs if you want them:

  1. gbrain orphans command — returns the actual list of under-linked page slugs for enrichment prioritization, complementary to gbrain doctor. doctor reports the aggregate ("link coverage 85%, timeline 17%"); orphans returns the slug + title + domain list that drives which pages to enrich next. The filtering logic is the load-bearing part — excludes raw/, scratch/, pseudo-pages, entities/*/_index, auto-generated pages. On my 2,146-page brain the unfiltered count is ~1,247 (every raw podcast transcript); the filtered count is 3, and those 3 are real gaps worth acting on. Flags: --count, --json, --include-pseudo. Comes with test/orphans.test.ts. If you'd rather this be a doctor subcommand flag (doctor --orphans) I can reshape.

  2. Wikilink + domain-slug support in v0.12.0's link-extraction.ts canonical extractor. extractEntityRefs currently matches only [name](path) markdown. Karpathy-style wiki brains (mine, likely Wintermute's too) use [[page]] and [[page|display]] as the dominant link form — so the auto-link post-hook on put_page sees nothing. One commit adds:

    • Obsidian wikilink regex alongside the markdown regex in extractEntityRefs (both regexes driven by a shared DIR_PATTERN constant, so adding a new entity dir only touches one string)
    • DIR_PATTERN extended with entities|tech|finance|personal|openclaw so domain-organized wikis participate in the graph layer alongside the canonical people|companies|meetings|... dirs

    After enabling it on my brain: gbrain extract links --source db went from extracting nothing new to materializing +1,377 typed edges on a single pass. Markdown-link extraction behavior is unchanged; adding a parallel wikilink code path, not replacing anything. Backward compatible.

Either/both sound useful, or should I drop them? Happy to open a clean PR against v0.12.0 master if you want to evaluate. Thanks again for the thorough treatment — I'll upgrade and run gbrain repair-jsonb when v0.12.1 tags.

garrytan added a commit that referenced this pull request Apr 18, 2026
…196)

* fix: splitBody and inferType for wiki-style markdown content

- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores

Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: extract wikilink syntax with ancestor-search slug resolution

extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard

- New gbrain repair-jsonb command. Detects rows where
  jsonb_typeof(col) = 'string' and rewrites them via
  (col #>> '{}')::jsonb across 5 affected columns:
  pages.frontmatter, raw_data.data, ingest_log.pages_updated,
  files.metadata, page_versions.frontmatter. Idempotent — re-running
  is a no-op. PGLite engines short-circuit cleanly (the bug never
  affected the parameterized encode path PGLite uses). --dry-run
  shows what would be repaired; --json for scripting.

- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
  Modeled on v0_12_0 pattern, registered in migrations/index.ts.
  Runs automatically via gbrain upgrade / apply-migrations.

- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
  anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
  pattern. Wired into bun test via package.json. Best-effort static
  analysis (multi-line and helper-wrapped variants are caught by the
  E2E round-trip test instead).

- Updates apply-migrations.test.ts expectations to account for the new
  v0.12.1 entry in the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.12.1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.12.1

- CLAUDE.md: document repair-jsonb command, v0_12_1 migration,
  splitBody sentinel contract, inferType wiki subtypes, CI grep
  guard, new test files (repair-jsonb, migrations-v0_12_1, markdown)
- README.md: add gbrain repair-jsonb to ADMIN command reference
- INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add
  v0.12.1 upgrade guidance for Postgres brains
- docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on
  Postgres-backed brains
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with
  migration steps, splitBody contract, wiki subtype inference
- skills/migrate/SKILL.md: document native wikilink extraction
  via gbrain extract links (v0.12.1+)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 19, 2026
Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 19, 2026
…kilinks, orphans (#216)

* fix(sync): remove nested transaction that deadlocks > 10 file syncs

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).

(cherry picked from commit 4a1ac00)

* test(sync): regression guard for #132 top-level engine.transaction wrap

Reads src/commands/sync.ts verbatim and asserts no uncommented
engine.transaction() call appears above the add/modify loop. Protects
against silent reintroduction of the nested-mutex deadlock that hung
> 10-file syncs forever in ep_poll.

* feat(utils): tryParseEmbedding() skip+warn sibling for availability path

parseEmbedding() throws on structural corruption — right call for ingest/
migrate paths where silent skips would be data loss. Wrong call for
search/rescore paths where one corrupt row in 10K would kill every
query that touches it.

tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on
any shape that would throw, warns once per session so the bad row is
visible in logs. Use it anywhere we'd rather degrade ranking than blow
up the whole query.

Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call
site) — the 5-line rescore loop was the direct motivator. Keep the
throwing parseEmbedding() for everything else (pglite-engine rowToChunk,
migrate-engine round-trips, ingest).

* postgres-engine: scope search statement_timeout to the transaction

searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with

  await sql`SET statement_timeout = '8s'`
  try { await sql`<query>` }
  finally { await sql`SET statement_timeout = '0'` }

but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.

Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.

No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.

Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.

Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.

Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.

(cherry picked from commit 6146c3b)

* feat(orphans): add gbrain orphans command for finding under-connected pages

Surfaces pages with zero inbound wikilinks. Essential for content
enrichment cycles in KBs with 1000+ pages. By default filters out
auto-generated pages, raw sources, and pseudo-pages where no inbound
links is expected; --include-pseudo to disable.

Supports text (grouped by domain), --json, --count outputs.
Also exposed as find_orphans MCP operation.

Tests cover basic detection, filtering, all output modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit f50954f)

* feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor

extractEntityRefs now recognizes both syntaxes equally:
  [Name](people/slug)      -- upstream original
  [[people/slug|Name]]     -- Obsidian wikilink (new)

Extends DIR_PATTERN to include domain-organized wiki slugs used by
Karpathy-style knowledge bases:
  - entities  (legacy prefix some brains keep during migration)
  - projects  (gbrain canonical, was missing from regex)
  - tech, finance, personal, openclaw (domain-organized wiki roots)

Before this change, a 2,100-page brain with wikilinks throughout extracted
zero auto-links on put_page because the regex only matched markdown-style
[name](path). After: 1,377 new typed edges on a single extract --source db
pass over the same corpus.

Matches the behavior of the extract.ts filesystem walker (which already
handled wikilinks as of the wiki-markdown-compat fix wave), so the db and
fs sources now produce the same link graph from the same content.

Both patterns share the DIR_PATTERN constant so adding a new entity dir
only requires updating one string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 1cfb156)

* feat(doctor): jsonb_integrity + markdown_body_completeness detection

Add two v0.12.1-era reliability checks to `gbrain doctor`:

- `jsonb_integrity` scans the 4 known write sites from the v0.12.0
  double-encode bug (pages.frontmatter, raw_data.data,
  ingest_log.pages_updated, files.metadata) and reports rows where
  jsonb_typeof(col) = 'string'. The fix hint points at
  `gbrain repair-jsonb` (the standalone repair command shipped in
  v0.12.1).

- `markdown_body_completeness` flags pages whose compiled_truth is
  <30% of the raw source content length when raw has multiple H2/H3
  boundaries. Heuristic only; suggests `gbrain sync --force` or
  `gbrain import --force <slug>`.

Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage
that should have caught the original double-encode bug. Hits all four
write sites against real Postgres and asserts jsonb_typeof='object'
plus `->>'key'` returns the expected scalar.

Detection only: doctor diagnoses, `gbrain repair-jsonb` treats.
No overlap with the standalone repair path.

* chore: bump to v0.12.3 + changelog (reliability wave)

Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: update project documentation for v0.12.3

CLAUDE.md:
- Add src/commands/orphans.ts entry
- Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity +
  markdown_body_completeness check descriptions
- Update src/core/link-extraction.ts to mention Obsidian wikilinks +
  extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw)
- Update src/core/utils.ts to mention tryParseEmbedding sibling
- Update src/core/postgres-engine.ts to note statement_timeout scoping +
  tryParseEmbedding usage in getEmbeddingsByChunkIds
- Add Key commands added in v0.12.3 section (orphans, doctor checks)
- Add test/orphans.test.ts, test/postgres-engine.test.ts, updated
  descriptions for test/sync.test.ts, test/doctor.test.ts,
  test/utils.test.ts
- Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap
- Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3)

README.md:
- Add gbrain orphans to ADMIN commands block

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: sunnnybala <dhruvagarwal5018@gmail.com>
Co-authored-by: Gustavo Aragon <gustavoraularagon@gmail.com>
Co-authored-by: Clevin Canales <clevin@Clevins-MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Clevin Canales <clev.canales@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants